-
Couldn't load subscription status.
- Fork 837
Add metrics for rejected queries in QFE #5356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for rejected queries in QFE #5356
Conversation
pkg/frontend/transport/handler.go
Outdated
| } | ||
|
|
||
| var reason string | ||
| // 413, 422 or 429. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we need this comment :-) It's just redundant with the if statements.
pkg/frontend/transport/handler.go
Outdated
| ) | ||
|
|
||
| var ( | ||
| LimitTooManySamples = `query processing would load too many samples into memory` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be exported? I only see this being used in the transport package.
| * [CHANGE] Ingester: Creating label `native-histogram-sample` on the `cortex_discarded_samples_total` to keep track of discarded native histogram samples. #5289 | ||
| * [FEATURE] Store Gateway: Add `max_downloaded_bytes_per_request` to limit max bytes to download per store gateway request. | ||
| * [FEATURE] Added 2 flags `-alertmanager.alertmanager-client.grpc-max-send-msg-size` and ` -alertmanager.alertmanager-client.grpc-max-recv-msg-size` to configure alert manager grpc client message size limits. #5338 | ||
| * [FEATURE] Query Frontend: Add `cortex_discarded_queries_total` metric for throttled queries. #5356 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cortex_failed_queries_total is a better name than cortex_discarded_queries_total based what we are doing in this PR. It doesn't make sense that a query can be "discarded".
using "failed" has the benefit that we can use the same metric name for 5xx later on -- just have different value for reason.
I would recommend to change this entry to
[FEATURE] Query Frontend: Add cortex_failed_queries_totalmetric for failed queries withreason label. #5356
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree cortex_discarded_queries_total is not a good name here. What I want to track is probably cortex_throttled_queries_total.
cortex_failed_queries_total is too general and I don't really want to track 5xx with this metric, 5xx can be of difference reasons and I feel it is more useful to track throttled queries for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yea cortex_throttled_queries_total make more sense. But "throttled" in this context is still weird because it's supposed to mean "flow control", but what we are doing here is not flow control when chunk is too big.
How about cortex_rejected_queries_total?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the name of cortex_rejected_queries_total!
pkg/frontend/transport/handler.go
Outdated
| reasonTooManyRequests = "too_many_requests" | ||
| reasonTooLongRange = "too_long_range" | ||
| reasonTooManySamples = "too_many_samples" | ||
| reasonSeriesFetched = "series_fetched" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it would be more consistent if we use too_many_series_fetched. I am thinking about cortex_discarded_queries_total{reason="series_fetched"} v.s. cortex_discarded_queries_total{reason="too_many_series_fetched"}; I think later one is more clear to reader.
However, the counter argument is that the label name become bigger and may have performance impact. In which case maybe we should change too_many_samples to samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too_many_ is kind of redundant I feel as we usually don't limit too_few. I will try to rename a few but I have to keep too_many_requests and too_many_samples. It is not easy to find a better name
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
131cee9 to
bac7b1f
Compare
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍺 🍻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍺 🍻
What this PR does:
Add metric
cortex_rejected_queries_totalin QFE.Which issue(s) this PR fixes:
Fixes #5355
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]